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In the recent past, convolutional neural networks (CNNs) have seen 
resurgence and have performed extremely well on vision tasks. Visually the 
model resembles a series of layers each of which is processed by a function 
to form a next layer. It is argued that CNN first models the low level features 
such as edges and joints and then expresses higher level features as a 
composition of these low level features. The aim of this paper is to detect 
multi-view faces using deep convolutional neural network (DCNN). 
Implementation, detection and retrieval of faces will be obtained with the 
help of direct visual matching technology. Further, the probabilistic measure 
of the similarity of the face images will be done using Bayesian analysis. 
Experiment detects faces with +90 degree out of plane rotations. Fine tuned 
AlexNet is used to detect pose invariant faces. For this work, we extracted 
examples of training from AFLW (Annotated Facial Landmarks in the Wild) 
dataset that involve 21K images with 24K annotations of the face. 
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1. INTRODUCTION 

We can define face detection as the process of extracting faces from the given images. Hence, the 
system should positively identify a certain region as a face. According to Yang et al. and Erik Hjelmas et al., 
face detection is a process of finding regions of the input image where the faces are present [1-2]. A lot of 
work has been done in detecting faces in still and frontal faces in-plane as well as complex background [3]. 
With the advancement in the field of information technology and computational power, computers are more 
interactive with humans. This human-computer interface (HCI) is done mostly via traditional devices like 
mouse, keyboard, and display. One of the most important medium is the face and facial expression [4], [5]. 

There are several algorithms that address frontal face detection [6] but only a small number of 
techniques exist that addresses non-frontal or multi-view face detection [7]. Most of the techniques uses 
scanning the image with sub window and then classify the sub window as a face and non-face pattern. The 
statistical learning methods are used for classification. This is because the pixels on faces are highly 
correlated while in non-face sub window they have less regularity. Hence use of nonlinear classifier is 
necessary due to huge variations in lightning and illumination, face expression, pose or appearance 
variations. Examples of such techniques are neural networks [8] or Support Vector Machines [9]. They used 
two neural network classifiers, first one for pose estimation and second for conventional face detection. 
Schneiderman et al. [10] proposed a technique that detects faces with out-of-plane rotation. In [11], Jones 
and Viola [12] extend this framework. Convolutional Neural Networks (CNN) [13], are the most recent 
cascade framework with quick rejection of background regions. The amount of research works on multi-view 
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face detection making use of CNNs is exploding [14, 15], success of CNNs in many computer vision 
problem. 

The CNNs can be visualized as a series of layers. The initial set of layers respond to discriminative 
low level patterns. The next set of layers respond to intermediate patterns which are composed of low level 
patterns and so on. The inspiration for CNNs and neural networks in general has been the biological 
understanding of the brain. It has been known for quite some time that the brain is made of over 100 billions 
neurons and these neurons are densely connected. The CNNs mimic neurons and their connections. A layer 
in CNN is made up of m x n neurons and neurons of the neighbouring layers are connected. In this section, 
we will describe about neurons, the connections and various types of layers that the modern CNNs have. 
Zhang et al. studied about enhancing multi-view detection of a face with multi-task deep CNN [16]. Farfade 
et al. [17] conducted a research to examine multi-view detection of the face using deep CNN. According to 
Parkhi et al. the recognition of the face from either a set of faces or single photograph tracked in a video [18]. 
Li et al. analyzed about CNN cascade for detecting the face [19]. 

Detecting face is a well-studied problem in the vision of computer. Contemporary detectors of the 
face can effortlessly identify near front faces. Complexities in detecting the face come from two aspects such 
as large space for searching of probable face sizes, positions and large visual differences of human faces in a 
chaotic environment. Former one imposes a requirement for the efficiency of time while later one needs a 
detector for a face to perfectly addressing a binary issue in classification. It was noted that uncontrolled issue 
in detecting face are extreme illuminations and exaggerated expressions can lead to large differences in 
visual in the appearance of the face and affect the face detector robustness [20]. This is significant to develop 
a method to properly detecting the faces as pointed out in [21]. Therefore, this particular research intends to 
concentrate on detecting the face with the help of multi-view face using deep convolution neural network. 

In this work, we have presented a novel architecture of deep convolutional neural network (DCNN) 
for multi-view face detection. In most of the previous work feature selection was manual, that is handcrafted 
but in convolutional neural network feature selection is automatically, even in complex visual variations. As 
we know that CNNs need huge computational power because it requires exhaustively scanning of the entire 
image in multiple scales which is a bit difficult. Hence to speed up the detection, we proposed a CNN 
cascade structure which rejects false detection very quickly in early stage. The most prominent contribution 
of our work is as follows: 

1. We designed a CNN cascade for fast face detection. 

2. Our designed architecture is able to detection pose invariant faces in changing environment. 
3. Our design is able to handle multi resolution images. 

4. We improve the state-of-the-art performance on the face detection data set and benchmark. 


2. IMPLEMENTAION METHOD 

In the implementation, detection of the face and retrieval of the image will be attained with the help 
of direct visual matching technology. A probabilistic computation of resemblance among the images of the 
face will be conducted on the basis of the Bayesian analysis for achieving various detection of the face. After 
this, a neural network will be developed and trained in order to enhance the outcome of the Bayesian 
analysis. Next, to that, training and verification will be adapted to test other images which involve similar 
face features. Deep learning can be performed by supervisory signals. 


Ident(f,t, Oia) = — f Liki log; = — log f, (1) 


Where, f is the feature vector, t represents target class and Øjąq is softmax layer parameter, p; is the target 
probability distribution (p;=0 for all i except p,=1). p;=l is the predicted probability distribution. 
The verification signal regularize feature and reduces intra personal variations given by Hadsell et al. [22]. 


; Ili - All; if yz =1 
Verif (fo fj: Vij Ove) =), 2 (2) 
5 max(0,m — (lif: - fill,) if Yj =—1 
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Where, Øe = {w, b}; are denote shifting parameters and learning scaling, o represented as sigmoid function 
and y;j is denoted as binary target of two compared facial images relate to same identity. Further operation 
of the convolution is represented as: 

yi = max(0, DIM +F; KYM x x! (4) 


Where, x! is input map and y/ is output map, k} is the convolution between input and output. Maxpooling 
is given by: 


Ly oo i 
Vik z oa Xjstmks+n} (5) 


Where, output map pools over s X s non-overlapping region. 
yj = max (0, £; x} Wij + Dx? Wij + b; ) (6) 
2 


Where,x1,w1,x?, w? represent the neurons and weights in 3 and 4" convolutional layers. Output of 
ConvNet is n-way software to predict the distribution of probability over n-unique identities. 


a 


__exp(¥j) 
t Wh exp) O) 


DCNN is mostly adopted for classification and also adopted for detection and recognizing the face. 
Most of them consider the cascade strategy as well as consider batches with various locations and scales 
as inputs. 


2.1. Proposed algorithm for deep convolutional neural network (DCNN) 


This particular work develops an algorithm for detecting the face using multi-view with the help of 
deep convolution neural network. The steps of implementation are described below: 
Step 1: In the implementation, detection of face and retrieval of the image will be attained with the help of 
direct visual matching technology which matches the face directly. This technology makes use of similarity 
metrics of an image which can either be normalized correlation or it can be Euclidean distance, which 
corresponds to the approach called template matching. The similarity between the two images is measured 
through similarity measure, denoted by S(Iq,1I,), Where, I, and I, are the two images between which the 
similarity is being measured. 
Step 2: The next step is measuring probabilistic similarity or A (the measure of intensity difference between 
the two images) given by Probabilistic similarity or A= (J, — I») . This calculation of resemblance among 
the images of face will be conducted on the basis on the Bayesian analysis for achieving various detection of 
face. 
Step 3: The probabilistic calculation of resemblance also supports multiple face detection. In order to 
characterize the various types of image variations were used for statistical analysis. Under this the similarity 
measure S (Ia, Ip) between the pair of images I, and I, is given in terms of posteriori probability (interpersonal 
variation) is provided by: 


Sla Ip) = PONP (0714) /{P (7) P(Q7|4) + Pg) P(Og|A)} (8) 


If the multi-view face detection is done for a single person then P(Q,|A) > P(Q,|A) or it can be said 


that S(Ig,1,) > 1⁄2. 

Step 4: Further a neural network will be developed and trained in order to enhance the outcome from the 

Bayesian analysis. 

Step 5: Next to that, training and verification will be adopted to test other images which involve similar face 

features. Implementation of the code is done step by step as follows: 

a. First, the DCNN object is created. 

b. Second, after this Graphical user interface is initialized. 

c. Then MCR (Misclassification rate) calculation is initialized and plot of MCR id created defining the 
current epoch, iteration, RMSE (Root Mean Square Error), MCR value of the image data. 

d. Training data is being loaded. 

e. Training data is pre-processed, errors are deleted and then image data is simulated. 
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f. After the simulation, the multi-faces are detected in the image shown in the red rectangular boxes. 

The screenshot for CNN training progress is shown in Figure 1. The plot of RMSE in training and 
plot of MCR is also shown in CNN training progress. The below equation is the CNN which is trained to 
minimize the risk of soft max loss function. 


R = Yx, eg log | prob(xily) | (9) 


Here ‘P’ represents the batch used in iteration of stochastic gradient descent and label is ‘x,’ and 
'y;' Hessian calculation progress is started. Current epoch used for this is 3.00. Iteration value used for this 
research is 759.00. RMSE value used for this research is 0.18. MCR value used for this research is 0.90. Here 
‘theta’ used is 8.000e°°. Plot of RMSE in training is showed in zigzag lines. Plot of MCR in training is 
showed in curved lines. 


The plot of RMSE in training 


i Current Epoch: 2.00 
$ Iteration value: 1518.00 
i RMSE value: 0.22 


---4 MCRvalue: (054 


i Teta 2.000e-04 


| Abort training process 


0 i i i i i 
1 15 2 25 3 35 4 45 5 5.5 6 
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Figure 1. DCNN training process 


2.2. CNN Structure 
The CNN structure which is adopted in the present study is shown in Figure 2 which consists of 12- 
net CNN, 24-net and 48-net structure. 
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Figure 2. CNN structures of the (a) 12-net, (b) 24-net and (c) 48-net 
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a. 12-net CNN 

It is the first CNN that scans or tests the image quickly in the test pipeline. An image having the 
dimensions of w * h having the pixel spacing of 4 with 12x12 detection windows for such type of image 
12-net CNN is suitable to apply. This would result a map of: 


a aA) (10) 


A point on the image map defines detection window of 12x12 onto the testing image. The minimum 
size of the face acceptable for testing an image is ‘T’. Firstly an image pyramid is built through the test image 
in order to cover the face from varied scales. At each level an image pyramid is created, it is resized by 
12/T which would serve as an input image for 12-net CNN. Under this structure, 2500 detection windows 
are created as shown in Figure 1. 

b. 12-Calibration-net 

For bounding box calibration, 12-calibration-net is used. Under this the dimension of the detection 
window is (x, y, w, h) where 'x’ and 'y’ are the axis, ‘x’ and 'h' are the width and height respectively. 
The calibration pattern adjusts itself according to the size of the window is: 


@- My we * (11) 


Sn Sn Sn Sh 
In the present study number of patterns i.e. N=45. Such that: 


S„ e {0.87,0.95,1.2, 1.13, 1.25} 
xn € {—0.19,0,0.19} 
Yn e {—0.19,0,0.19} 


The image is cropped according to the size of detection window that is 12*12 which would serve as 
an input image to 12-calibration-net. Under this CNN average result of the patterns are taken because the 
patterns obtained as an output are not orthogonal. A threshold value is taken i.e. t in order to remove the 
patterns which are not the confidence patterns 
c. 24-net CNN 

In order to further lower down the number of the detection windows used, a binary classification of 
CNN called 24-net CNN is used. The detection window which remained under the 12-calibration net are 
taken and then resized to 24*24 image and then this image is re-evaluated using 24-net. Also under this 
CNN, a multi-resolution structure is adopted, through this, the overall overhead of the 12-net CNN structure 
got reduced and hence the structure becomes discriminative. 

d. 24-Calibration-net CNN 

It is another calibration CNN similar to that of 12-calibrationnet. Also under this number of 
calibration patterns are N. the process of calibration is similar to that of 12-calibration-net. 
e. 48-net CNN 

It is the most effective CNN used after 24-calibration-net but is quite slower. It follows the same 
procedure as in 24-net. This procedure used in this CNN is very complicated as compared to rest of the CNN 
substructures. It also adopts the multi-resolution technique as in case of 24-net. 

f. 48-calibration-net CNN 

It is the last stage or sub-structure of CNN. The number of calibration patterns used is same as in 
case of 12-calibration-net i.e. N=45. In order to have more accurate calibration, pooling layer is used under 
this CNN sub-structure. 


3. RESULT AND DISCUSSION 

Examples of the input images for two different identities with generated pose invariant output 
results are illustrated in Figure 3. In this figure, detected face for the various angle and poses for left and 
right profile faces including the frontal face are shown. Our detector gives results for images with varying 
poses with resolution. The modern face detection solutions performance on multi-view face set of data is 
unsatisfactory. Under this it was observed that in the presence of multi-resolution in CNN which is shown in 
Figure 5, number of false detection comes to halt (at the 10000 number of falsely detected faces) and the face 
is detected or the detection rate is achieved. 
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However, without the use of multi-resolution in CNN, more number of faces are detected falsely as 
compared to that of multi-resolution shown in Figure 4. Examples of the input images for two different 
identities with generated pose invariant output results are illustrated in Figure 3. In this figure, detected face 
for the various angle and poses for left and right profile faces including frontal face are shown. 


Figure 3. Pose invariant face detected images; 
(a), (b), (c) and (d) are right profile faces; (e) Frontal Face; (f) Left up profile face and 
(g), (h) Right profile faces 


3.1. Comparison of Face Detectors 

Effectiveness of the developed method is compared and contrasted with existing methods and 
techniques. It was noted that proposed method performs well in terms of accuracy and the recognition rate. 
We compare our method with other approaches including EdgeBox [23], Faceness [24], and DeepBox [25] 
on AFLW data set. Our method detects the input image at low resolution by rejecting quickly non-face 
regions for accurate detection. The use of Calibrated nets in the cascade improves the quality of bounding 
box. Meanwhile, we show that our detector can be easily tuned to be a faster version with minor performance 
decrease. The use of multi-resolution in CNN, more number of faces is detected falsely as compared to that 
of multi-resolution shown in Figure 4. 


Detection Rate with multi resolution 
12 Detection Rate without multi resolution 
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Figure 4. Detection rate with multi-resolution in 24-net CNN 


The overall test sequence is shown in Figure 5. First of all test image is applied to the system, a 12 
net structure will scan the whole image and quickly rejects about 90% of detection windows. Remaining 
detected window will be processed by 24 calibrated CNNs. In next subsequent stages, the highly overlapped 
window will be removed. Then a 48 net will take detected windows and evaluate the window with calibrated 
boundary box and produces as output as detected boundary box. Figure 6 shows all detection stages with 
different structure stages. 
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(b) 


Figure 5. Detection results: (a) Original image given for detection, (b) Image at preprocessing stage 
(c) Detected face position with CNN 


Figure 6. Detection results for different CNN structure: (a) Input/test image, (b) Image after 12-net CNN, 
(c) Image after 24-net CNN, (d) Image after 48-net CNN, (e) Output face detected image 


4. CONCLUSION 

In this work, we develop an algorithm for detecting multi-view faces using deep convolution neural 
network. A major contributions were made in this particular research is that we have developed a procedure 
which can assemble a wide range of dataset, with the small noise of label while reducing the quantity of 
manual annotation included. The main concept of the algorithm is to influence the high ability of DCNN to 
classify and extract the feature. To learn the single classifier for detecting faces from different views and 
reduce the computational difficulty to simplify the detector architecture. For this work, we first transformed 
the completely linked layers into the convolutional one to reshape the parameters of the layer. By exploring a 
few key features of the network structure, we achieve high performance convolutional networks with a 
relatively small scale. Our detector gives results for images with varying poses and resolutions. 
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